Evaluation DGE algorithms

All the results were filtered by the adj.p-value < 0.05 and logFC >= 0.3

## An object of class Seurat 
## 15327 features across 6961 samples within 1 assay 
## Active assay: RNA (15327 features, 0 variable features)
##  2 dimensional reductions calculated: umap, pca

Single cell methods:

For single cell methods I am going to consider both Seurat and Scanpy

Scanpy

here

For Scanpy I take in account the methods: - t-test_overestim_var - wilcoxon

Unfortunately the logistic regression provided by scanpy does not give LogFoldChanges and will not be taken in account.

The t-test_overestim_var recognized as significative almost all the genes and include almost all the genes from the wilcoxon test. Nolw let’s evaluate the expression of the genes.

Scanpy T-test

Top upregulated genes in Erythrocytes:

Top downregulated genes in Erythrocytes:

Genes with lower abs(logFC)

The result’s in terms of LogFC looks really odd. Let’s see the values for this genes:

##          Symbol     pvals_adj logfoldchanges  abs_lfc    scores
## 68         KRT1  0.000000e+00       29.73123 29.73123  48.64429
## 113   LINC00570 3.816327e-227       28.53994 28.53994  35.04078
## 131        ARG1 2.312735e-192       28.81118 28.81118  31.80785
## 13375    ARPP21 1.717660e-226      -29.04441 29.04441 -34.57604
## 13480      CD72 1.808806e-242      -28.93440 28.93440 -35.96811
## 13887       MME  0.000000e+00      -29.52419 29.52419 -44.01670
##       Aberrant.erythroid Pro.B.cells
## 68             0.4846416   0.0000000
## 113            0.3146137   0.0000000
## 131            0.2832765   0.0000000
## 13375          0.0000000   0.2891921
## 13480          0.0000000   0.3130016
## 13887          0.0000000   0.4122525

Apparently Scanpy has a bias evaluating the logFC of genes that are completely missing in one of the conditions. But The genes are correctly plotted in the report. So I will check the expression by the z-score instead of the LogFC:

## [1] "Top upregulated genes"

## [1] "Top downregulated genes"

## [1] "lowest abs(LFC) genes"

Better result for the extremes logFC (scores), bt still not clear results for the low scored genes. Indeed those genes are expressed by less than 1% of cells:

##         Symbol  pvals_adj logfoldchanges  abs_lfc    scores Aberrant.erythroid
## 816 MIR133A1HG 0.04988015      -2.444078 2.444078 -1.980822       0.0006205399
## 817       UACA 0.04972146      -1.606474 1.606474 -1.982030       0.0009308098
## 818    CEP170B 0.04952603      -2.087353 2.087353 -1.983738       0.0003102699
##     Pro.B.cells
## 816 0.002140182
## 817 0.005082932
## 818 0.003477796
## [1] "Upregulated GO terms"

## [1] "Downregulated GO terms"

Some up terms can be linked to aberrant erythroid cells, down terms slcan be linked to differentiation/prolferation properties od pro-B cells

Scanpy wilcoxon

## [1] "Top upregulated genes"

## [1] "Top downregulated genes"

## [1] "lowest abs(LFC) genes"

## [1] "Upregulated GO terms"

## [1] "Downregulated GO terms"

Similar result with wilcoxon, slightly better GO terms

Seurat

here

For Seurat I take in account the methods: - t-test - wilcoxon - logistic regression

Similar numbers of DE genes were found by Seurat’s different flavours.

Seurat T test

## [1] "Top upregulated genes"

## [1] "Top downregulated genes"

## [1] "lowest abs(LFC) genes"

## [1] "Upregulated GO terms"

## [1] "Downregulated GO terms"

Better result with seurat, the top upregulated genes are strictly correlated with erythrocytes. Only CD74 is downregulated and correlated with immune cells. The low abs(LFC) genes still are not well evaluable. In average the different gene expression is better appreciable

GO terms are also better compared to scanpy

Seurat wilcoxon

## [1] "Top upregulated genes"

## [1] "Top downregulated genes"

## [1] "lowest abs(LFC) genes"

## [1] "Upregulated GO terms"

## [1] "Downregulated GO terms"

Similar to the previous

Seurat logreg

## [1] "Top upregulated genes"

## [1] "Top downregulated genes"

## [1] "lowest abs(LFC) genes"

As above.

Now let’s see the genes fuond uniquely by logreg/wx and t-test:

Unique in wx/logeg:

## # A tibble: 1 × 8
##      p_val avg_log2FC pct.1 pct.2   p_val_adj cluster            Symbol abs_lfc
##      <dbl>      <dbl> <dbl> <dbl>       <dbl> <chr>              <chr>    <dbl>
## 1 1.55e-11      0.319 0.289 0.407 0.000000237 Aberrant erythroid NEAT1    0.319

NEAT1 result upregulated in Aberrant erythroid cells only with wilcoxon/logreg method. The logFC are quite small (0.32) and the distribution is similar between the cell lines. Looking at the expression divided by the batch, no batch specific differential expression is available.

Unique in t-test:

## # A tibble: 5 × 8
##      p_val avg_log2FC pct.1 pct.2 p_val_adj cluster            Symbol abs_lfc
##      <dbl>      <dbl> <dbl> <dbl>     <dbl> <chr>              <chr>    <dbl>
## 1 1.11e-15      0.469 0.237 0.216  1.70e-11 Aberrant erythroid RNF19A   0.469
## 2 2.33e- 7      0.432 0.24  0.284  3.58e- 3 Aberrant erythroid MGST3    0.432
## 3 3.87e-12      0.311 0.207 0.175  5.93e- 8 Aberrant erythroid ZNF451   0.311
## 4 2.10e- 6     -0.313 0.398 0.391  3.22e- 2 Aberrant erythroid ODC1     0.313
## 5 2.39e- 7     -0.520 0.483 0.457  3.67e- 3 Aberrant erythroid REXO2    0.520

All the genes doesn’t have a clear difference in the expression among the batches, I will drop the seurat t-test and keep the logreg/wx

Pseudobulk

In the pseudobulks, counts were summed for each celltype and sample, in practice obtaining a column of counts for each celltype and sample. Then, genes that weren’t showing counts in more than 90% of the cells were removed. Also, genes showing expression below 10 counts in all the pseudobulks were removed (pseudobulks are the sum of cells counts. showing less than 10 counts indicate almost all the cells didn’t had count’s for that gene).

## [1] 11505   121

DESeq2

For DESeq2, firstly I run it with the “LRT” methods that showed better results here. The PCA plot looks good but the clustering mix some samples here. I run DESeq2 specifing the model to take account of the batches.

Then, I tested combat_seq to correct the batch effect here. Also, I decided to see if there was a batch effect afrom unknown sources using the single variable analysis (sva) here.

DESeq2 without batch correction found the least number of DE genes compared to the methods tested until now. Also, the DEgenes found with batch corrected methods differs from DESeq2.

DESeq2-LRT

## [1] "Top upregulated genes"

## [1] "Top downregulated genes"

## [1] "lowest abs(LFC) genes"

## [1] "Upregulated GO terms"

## [1] "Downregulated GO terms"

Genes with high and low logFC are well divided. Also, the genes with low abs(logDC) show better division between the cell types. BP terms aren’t really specific.

DESeq2-combat

## [1] "Top upregulated genes"

## [1] "Top downregulated genes"

## [1] "lowest abs(LFC) genes"

## [1] "Upregulated GO terms"

## [1] "Downregulated GO terms"

Genes with high and low logFC are well divided. Also, the genes with low abs(logDC) show better division between the cell types. BP terms aren’t really specific.

BP terms aren’t really specific. read this for vescicole and autophagy.

DESeq2-sva

## [1] "Top upregulated genes"

## [1] "Top downregulated genes"

## [1] "lowest abs(LFC) genes"

## [1] "Upregulated GO terms"

## [1] "Downregulated GO terms"

Despite both found ~4k DE genes, only a partial overlap of DE genes between DESeq2 + sva and Seurat. Also, the top lgFC genes have really poor expression.

limma

Normal and with combat

limma limma combat

Limma found more than 7k genes, no differences with the batch corrected data

Limma-voom

## [1] "Top upregulated genes"

## [1] "Top downregulated genes"

## [1] "lowest abs(LFC) genes"

## [1] "Upregulated GO terms"

## [1] "Downregulated GO terms"

High logFC are clear, low abs(logFC) aren’t clear Read this and this for cilia

EdgeR

Finally I tested edgeR with LRT as best option from here, I tested combat too. edgeR edgeR_combat

Both the methods found more than 9k genes with some hundreds uniquel per each method.

EdgeR

## [1] "Top upregulated genes"

## [1] "Top downregulated genes"

## [1] "lowest abs(LFC) genes"

## [1] "Upregulated GO terms"

## [1] "Downregulated GO terms"

Extreme LogFC show different results from what obtained with other algorithms, maybe because of the hand-written method. Low abs(logFC) are not appreciable. BP terms looks good

EdgeR_combat

## [1] "Top upregulated genes"

## [1] "Top downregulated genes"

## [1] "lowest abs(LFC) genes"

## [1] "Upregulated GO terms"

## [1] "Downregulated GO terms"

High number of DE genes found, the low(logFC) aren’t well distinguishable. GOOD BPs

HBA1/2 bimodal distribution

RRMM_BM_8 has a lower expression of both the subunits… Interestingly is a longitudinal sample and we can look at the expression of the genes along the samples. Also, are those cells misslabelled? or the expression of HBA1/2 is lower in all the cell types. To check it I am loading the original dataset and plot the expression of all the samples from this patient:

##            used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells  6313873 337.2   11137894 594.9 11137894 594.9
## Vcells 11062202  84.4   58216616 444.2 72770761 555.2

Interestingly also the expression of HBA1/2 in other cell lines is lower in RR_BM_8, even looking at the other longitudinal data from the same patient